Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor parameter tolerance handling #114

Closed
wants to merge 5 commits into from
Closed

Conversation

AdrianSosic
Copy link
Collaborator

This PR refactors the tolerance implementation of numerical discrete parameters:

  • Allowed tolerance values are no longer upper bounded by half the distance between the closest two parameter values. The original rationale was that larger values would result in matching ambiguities. However, this rationale is based on the unnecessary assumption that the user always targets a specific one of the available parameter values for their experiments. In practice, it doesn't matter if this was the case or not. Instead, we simply ingest the values we observe, no matter how the experiment was planned.
  • add_measurements now takes an on_tolerance_violation parameter which allow to control to behavior when encountering out-of-range measurements

So overall, the mechanism remains similar to the previous one but more user-friendly.

Before, in-range values were defined by the distance to the closest parameter point. Now, everything inside the convex hull is automatically considered in range.
@AdrianSosic AdrianSosic added the enhancement Expand / change existing functionality label Feb 6, 2024
@AdrianSosic AdrianSosic self-assigned this Feb 6, 2024
@AdrianSosic
Copy link
Collaborator Author

@Scienfitz @AVHopp: This is my first proposal for the tolerance implementation. Note that the new version does NOT (yet) include a second tolerance value, i.e. it does not distinguish between the "matching tolerance" and the "out of convex hull tolerance". The reason is simply because I'm not yet sure if the other approach adds too much complexity for a user, and perhaps we should postpone the second tolerance discussion until we have finally fixed the scaling (which is affected by that decision). With the current minimal implementation, users can still do anything they want. That is, small deviations are automatically accounted for due to the new tolerance default value, and for larger deviations, the can actively decide to ignore the errors by setting the corresponding flag.

But let me know what you think

)
if not param.is_in_range((val := row[param.name])):
if param.is_numeric and on_tolerance_violation == "ignore":
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt this be a continue? still need to checkt he other parameters in the list

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also think that this needs to be continue

f"The value '{val}' is outside the range of parameter "
f"'{param.name}'. "
f"If you expected a match between your input "
f"and the parameter, consider increasing the parameter's "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error is shown for both numerical and non-numerical parameters, hence should make the possible distinct situations clearer. Eg a categorical parameter has no tolerance

f"and the parameter, consider increasing the parameter's "
f"tolerance value or adding more parameter values. "
f"You can bypass this check using the 'ignore' or 'warn' mode."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i cant comment on the line, but we shouldnt forget about whats below here in line 411+

  1. it still uses old logger mechanism
  2. the warnings there should also be controlled by on_tolerance_validation. eg no need to print a no match found warning if its set to ignore

@@ -159,7 +159,7 @@ def validate_config(cls, config_json: str) -> None:
def add_measurements(
self,
data: pd.DataFrame,
numerical_measurements_must_be_within_tolerance: bool = True,
on_tolerance_violation: Literal["raise", "warn", "ignore"] = "raise",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since these violations are not an issue for scaling, I would go with warn as default, not raise. This is also needed in the API I think.

The user can explicitly still disallow violations but setting raise which would be similar to explicitly setting the old numerical_measurements_must_be_within_tolerance to True

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Copy link
Collaborator

@AVHopp AVHopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly minor changes.

@@ -159,7 +159,7 @@ def validate_config(cls, config_json: str) -> None:
def add_measurements(
self,
data: pd.DataFrame,
numerical_measurements_must_be_within_tolerance: bool = True,
on_tolerance_violation: Literal["raise", "warn", "ignore"] = "raise",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

@@ -226,12 +228,13 @@ def add_measurements(
)

# Telemetry
# TODO: Code is inefficient because of unnecessary second fuzzy matching
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which unnecessary second fuzzy matching are you talking about here? Is this something happening within one of the functions?

)
@tolerance.default
def default_tolerance(self) -> float:
"""Set the tolerance to fraction of the smallest value distance."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads as if the fraction can be chosen by the user. Also, we should add the information in the docstring that this fraction is set to 0.1

parameters are matched with the search space elements only if there is a
match within the parameter tolerance. If ``False``, the closest match is
considered, irrespective of the distance.
on_tolerance_violation: The mode determining what how to handle a missing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either add some words about the different modes here or refer to the place where you described them.

)
if not param.is_in_range((val := row[param.name])):
if param.is_numeric and on_tolerance_violation == "ignore":
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also think that this needs to be continue

@Scienfitz Scienfitz added the on hold PR progress is awaiting for something else to continue label Apr 3, 2024
@Scienfitz Scienfitz marked this pull request as draft June 19, 2024 08:22
@AdrianSosic AdrianSosic mentioned this pull request Jul 22, 2024
@AdrianSosic AdrianSosic mentioned this pull request Aug 28, 2024
@Scienfitz
Copy link
Collaborator

Closing this PR due to inactivity
Further alignment seems to be needed
The problem of tolerance refactoring is now tracked here #375

@Scienfitz Scienfitz closed this Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Expand / change existing functionality on hold PR progress is awaiting for something else to continue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants